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Abstract 

Wc present the implementation and performance of a new gravitational N- 
body tree-code that is specifically designed for the graphics processing unit 
(GPU)^ . All parts of the tree-code algorithm are executed on the GPU. 
We present algorithms for parallel construction and traversing of sparse oc- 
trees. These algorithms are implemented in CUDA and tested on NVIDIA 
GPUs, but they are portable to OpenCL and can easily be used on many- 
core devices from other manufacturers. This portability is achieved by using 
general parallel-scan and sort methods. The gravitational tree-code outper- 
forms tuned CPU code during the tree-construction and shows a performance 
improvement of more than a factor 20 overall, resulting in a processing rate 
of more than 2.8 million particles per second. 
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1. Introduction 

A common way to partition a three-dimensional domain is the use of 
octrees, which recursively subdivide space into eight octants. This structure 
is the three-dimensional extension of a binary tree, which recursively divides 
the one dimensional domain in halves. One can distinguish two types of 
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octrees, namely dense and sparse. In the former, all branches contain an 
equal number of children and the structure contains no empty branches. A 
sparse octree is an octree of which most of the nodes arc empty (like a sparse 
matrix), and the structure is based on the underlying particle distribution. 
In this paper we will only focus on sparse octrees which are quite typical for 
non-homogenous particle distributions. 

Octrees are commonly used in applications that require distance or inter- 
section based criteria. For example, the octree data-structure can be used 
for the range search method [1]. On a set of N particles a range search using 
an octree reduces the complexity of the search from 0{N) to C(logiV) per 
particle. The former, though computationally expensive, can easily be imple- 
mented in parallel for many particles. The later requires more communication 
and book keeping when developing a parallel method. Still for large number 
of particles (~ > 10^) hierarchical^ methods are more efficient than brute 
force methods. Currently parallel octree implementations are found in a wide 
range of problems, among which self gravity simulations, smoothed particle 
hydrodynamics, molecular dynamics, clump finding, ray tracing and voxel 
rendering; in addition to the octree data-structure these problems often re- 
quire long computation times. For high resolution simulations (~ > 10^) 
1 (Central Processing Unit) CPU is not sufficient. Therefore one has to use 
computer clusters or even supercomputers, both of which are expensive and 
scarce. An attractive alternative is a Graphics Processing Unit (GPU). 

Over the years GPUs have grown from game devices into more general 
purpose compute hardware. With the GPUs becoming less specialised, new 
programming languages like Brook [2], CUDA [3] and OpenCL [4] were in- 
troduced and allow the GPU to be used efficiently for non-graphics related 
problems. One has to use these special programming languages in order to 
be able to get the most performance out of a GPU. This can be realized 
by considering the enormous difference between todays GPU and GPU. The 
former has up to 8 cores which can execute two threads each, whereas a mod- 
ern GPU exhibits hundreds of cores and can execute thousands of threads in 
parallel. The GPU can contain a large number of cores, because it has fewer 
resources allocated to control logic compared to a general purpose CPU. 
This limited control logic renders the GPU unsuitable for non-parallel prob- 
lems, but makes it more than an order of magnitude faster than the CPU on 
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massively parallel problems [3]. With the recent introduction of fast double 
precision floating point operations, LI and L2 caches and ECC memory the 
GPU has become a major component in the High Performance Computing 
market. The number of dedicated GPU clusters is steadily increasing and 
the latest generation of supercomputers have nodes equipped with GPUs, 
and have established themselves in the upper regions of the topSOO^. 

This wide spread of GPUs can also be seen in the acceptance of GPUs in 
computational astrophysics research. For algorithms with few data depen- 
dencies, such as direct A^-body simulations, programming the GPU is rela- 
tively straightforward. Here various implementations are able to reach almost 
peak-performance [5, 6, 7] and with the introduction of iV-body libraries the 
GPU has taken over the GRAPE (GRAvity PipE [8]) ^ as preferred com- 
putation device for stellar dynamics [9]. Although it is not a trivial task to 
efficiently utilise the computational power of GPUs, the success with direct 
A'"-body methods shows the potential of the GPU in practice. For algorithms 
with many data dependencies or limited parallelism it is much harder to make 
efficient use of parallel architectures. A good example of this are the gravita- 
tional tree-code algorithms which were introduced in 1986 [10] as a sequential 
algorithm and later extended to make efficient use of vector machines [1 1] . 
Around this time the GRAPE hardware was introduced which made is pos- 
sible to execute direct A'"-body simulations at the same speed as a simulation 
with a tree-code implementation, while the former scales as 0{N^) and the 
latter as 0{N log N). The hierarchical nature of the tree-code method makes 
it difficult to parallelise the algorithms, but it is possible to speed-up the com- 
putational most intensive part, namely the computation of gravitational in- 
teractions. The GRAPE hardware, although unsuitable for constructing and 
traversing the tree-structure, is able to efficiently compute the gravitational 
interactions. Therefore a method was developed to create lists of interact- 
ing particles on the host and then let the GRAPE solve the gravitational 
interactions [12, 13]. Recently this method has successfully been applied to 
GPUs [14, 15, 16]. With the GPU being able to efficiently calculate the force 
interactions, other parts like the tree-construction and tree-traverse become 
the bottleneck of the application. Moving the data intensive tree-traverse to 
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the GPU partially lifts this bottleneck [17, 18]. This method turns out to be 
effective for shared time-step integration algorithms, but is less effective for 
block time-step implementations. In a block time-step algorithm not all par- 
ticles are updated at the same simulation time-step, but only when required. 
This results in a more accurate (less round-off errors, because the reduced 
number of interactions) and more efficient (less unnecessary time-steps) sim- 
ulation. The number of particles being integrated per step can be a fraction 
of the total number of particles which significantly reduces the amount of 
parallelism. Also the percentage of time spent on solving gravitational in- 
teractions goes down and other parts of the algorithm (e.g. construction, 
traversal and time integration) become more important. This makes the hi- 
erarchical tree A^-body codes less attractive, since CPU-GPU communication 
and tree-construction will become the major bottlenecks [7, 17]. One solu- 
tion is to implement the tree-construction on the GPU as has been done for 
surface reconstruction [19] and the creation of bounding volume hierarchies 
[20, 21]. An other possibihty is is to implement all parts of the algorithm 
on the GPU using atomic operations and particle insertions [22] here the 
authors, like us, execute all parts of the algorithm on the GPU. When we 
were in the final stages of finishing the paper we were able to test the imple- 
mentation by Burtscher et al. [22]. It is difficult to compare the codes since 
they have different monopole expansions and multipole acceptance criteria 
(see Sections 3.2 and 3.3). However, even though our implementation has 
higher multipole moments (quadrupole versus monopole) and a more strict 
multipole acceptance criteria it is at least 4 times faster. 

In this work we devised algorithms to execute the tree-construction on the 
GPU instead of on the CPU as is customarily done. In addition we redesign 
the tree-traverse algorithms for effective execution on the GPU. The time 
integration of the particles is also executed on the GPU in order to remove 
the necessity of transferring data between the host and the GPU completely. 
This combination of algorithms is excellently suitable for shared and block 
time-step simulations. Although here implemented as part of a gravitational 
A^-body code (called Bonsai, Section 3), the algorithms are applicable and 
extendable to related methods that use hierarchical data structures. 

2. Sparse octrees on GPUs 

The tree construction and the tree-traverse rely on scan algorithms, which 
can be efficiently implemented on GPUs. (Appendix A). Here we discuss 
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the main algorithms that can be found in all hierarchical methods. Starting 
with the construction of the tree-structure in Section 2.1, followed by the 
method to traverse the previously built tree-structure in Section 2.2. The 
methods that are more specific for a gravitational iV-body tree-algorithm 
are presented in Section 3. 

2.1. Tree construction 

The common algorithm to construct an octree is based on sequential 
particle insertion [10] and is in this form not suitable for massively parallel 
processors. However, a substantial degree of parallelism can be obtained if 
the tree is constructed layer-by-layer ^ from top to bottom. The construction 
of a single tree-level can be efficiently vectorised which is required if one uses 
massively parallel architectures. 

To vectorise the tree construction particles have to be mapped from a 
3-dimensional spatial representation to a linear array while preserving local- 
ity. This implies that particles that are nearby in 3D space also have to be 
nearby in the ID representation. Space filling curves, which trace through 
the three dimensional space of the data enable such reordering of particles. 
The first use of space filling curves in a tree-code was presented by War- 
ren and Salmon (1993) [23] to sort particles in a parallel tree-code for the 
efficient distribution of particles over multiple systems. This sorting also im- 
proves the cache-efficiency of the tree-traverse since most particles that are 
required during the interactions are stored locally, which improves caching 
and reduces communication. We adopt the Morton space filling curve (also 
known as Z-order) [24], because of the existence of a one-to-one map be- 
tween N-dimensional coordinates and the corresponding ID Morton-key. The 
Morton-keys give a ID representation of the original ND coordinate space 
and are computed using bit-based operations (Appendix B). After the keys 
are calculated the particles are sorted in increasing key order to achieve a 
Z-ordered particle distribution in memory. The sorting is performed using 
the radix-sort algorithm (see for our implementation details Appendix A), 
which we selected because of its better performance compared to alternative 
sorting algorithms on the GPU [25, 26]. After sorting the particles have to 
be grouped into tree cells. In Fig. 1 (left panel), we schematically demon- 
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strate the procedure. For a given level, we mask the keys^ of non-flagged 
particles (non-black elements of the array in the figure), by assigning one 
particle per GPU thread. The thread fetches the precomputed key, applies 
a mask (based on the current tree level) and the result is the octree cell to 
which the particle should be assigned. Particles with identical masked keys 
are grouped together since they belong to the same cell. The grouping is 
implemented via the parallel compact algorithm (Appendix A). We allow 
multiple particles to be assigned to the same cell in order to reduce the size 
of the tree-structure. The maximum number of particles that is assigned to a 
cell is Meaf, which we set to A^ieaf = 16. Cells containing up-to A^ieaf particles 
are marked as leaves, otherwise they are marked as nodes. If a particle is 
assigned to a leaf the particle is flagged as complete (black elements of the 
array in the flgure). The masking and grouping procedure is repeated for 
every single level in serial until all particles are assigned to leaves or that 
the maximal depth of the tree is reached, whichever occurs flrst. When all 
particles are assigned to leaves all required tree cells have been created and 
are stored in a continuous array (right panel of Fig. 1). 

However, to complete the tree-construction, the parent cells need to be 
linked to their children. We use a separate function to connect the parent 
and child cells with each other (hnking the tree). This function assigns a 
cell to a single thread which locates its parent and the flrst child if the cell 
is not a leaf (Fig. 2). This is achieved by a binary search of the appropriate 
Morton key in the array of tree-cells, which is already sorted in increasing 
key order during the construction phase. The thread increases the child 
counter of its parent cell and stores the index of the flrst child. To reduce 
memory, we use a single integer to store both the index to the flrst child 
and the number of children (most significant 3 bits). If the cell is a leaf, we 
store the index of the first particle instead of the index of the first child cell, 
together with the number of particles in the leaf. During these operations 
many threads concurrently write data to the same memory location. To 
prevent race conditions, we apply atomic read-modify-write instructions for 
modifying the data. At the end of this step, the octree is complete and can 
be used to compute gravitational attractions. 



^The masking is a bitwise operation that preserves the bits which are specified by the 
mask, for example masking "1011b" with "1100b" results in "1000b". 
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Figure 1: Schematic representation of particle grouping into a tree cell. (Left panel) 
The particles Morton keys are masked^ with the level mask. Next the particles with the 
same masked keys are grouped together into cells (indicated by a thick separator, such 
as in level 1). Cells with less than TVioaf particles are marked as leaves (green), and the 
corresponding particles are flagged (black boxes as in level 2 and 3) and not further used 
for the tree construction. The other cells are called nodes (blue) and their particles are 
used constructing the next levels. The tree construction is complete as soon as all particles 
are assigned to leafs, or when the maximal depth of the tree is reached. (Right panel) The 
resulting array containing the created tree cells. 



2.2. Tree traverse 

To take advantage of the massively parallel nature of GPUs we use breadth 
first, instead of the more common depth first, traversal. Both breadth first 
and depth first can be parallelised, but only the former can be efficiently 
vectorised. To further vectorise the calculation we do not traverse a sin- 
gle particle, but rather a group of particles. This approach is known as 
Barnes' vectorisation of the tree-traverse [11]. The groups are based on the 
tree-cells to take advantage of the particle locality as created during the 
tree-construction. The number of particles in a group is < Ncrit which is 
typically set to 64 and the total number of groups is Ai'groups- The groups 
are associated with a GPU thread-block where the number of threads in a 
block is A^biock with A^biock >= ^crit- Hereby we assume that thousands of 
such blocks are able to run in parallel. This assumption is valid for CUDA- 
enabled GPUs, as well as on AMD Radeon GPUs via OpenCL. If the code is 
executed with shared time-steps, all particles are updated at the same time 
and subsequently all groups are marked as active, for block time-steps this 
number varies between 1 and A^^groups- Each thread block executes the same 
algorithm but on a different set of particles. 
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Figure 2: Schematic illustration of the tree link process. Each cell of the tree, except the 
empty cells which are not stored, is assigned to a GPU thread. The thread locates the first 
child, if it exists, and the parent of the cell. The threads increment the child counter of 
the parent (indicated by the up arrows) and store the first child in the memory of the cell. 
This operation requires atomic read-modify-write operations because threads concurrently 
modify data at the same memory location. 
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Each thread in a block reads particle data that belongs to the correspond- 
ing group as well as group information which is required for the tree traverse. 
If the number of particles assigned to a group is smaller than A^biock by a fac- 
tor of two or more, we use multiple threads (up to 4) per particle to further 
parallelise the calculations. As soon as the particle and group data is read 
by the threads we proceed with the tree-traverse. 

On the CPU the tree-traverse algorithm is generally implemented using 
recursion, but on the GPU this is not commonly supported and hard to 
parallelise. Therefore we use a stack based breadth first tree-traverse which 
allows parallelisation. Initially, cells from one of the topmost levels of the 
tree are stored in the current-level stack and the next-level stack is empty; 
in principle, this can be the root level, but since it consists of one cell, the 
root node, only one thread from A^biock will be active. Taking a deeper level 
prevents this and results in more parallelism. We loop over the cells in the 
current-level stack with increments of A^biock- Within the loop, the cells are 
distributed among the threads with no more than one cell per thread. A 
thread reads the cell's properties and tests whether or not to traverse the 
tree further down from this cell; if so, and if the cell is a node the indexes of 
its children are added to the next-level stack. If however, the cell is a leaf, the 
indexes of constituent particles are stored in the particle-group interaction 
list. Should the traverse be terminated then the cell itself is added to cell- 
group interaction list (Fig.3). 

The cell-group interaction list is evaluated when the size of the list exceeds 
^biock- To achieve data parallelism each thread reads properties of an inter- 
acting cell into fast low-latency on-chip memory that can be shared between 
the threads, namely shared memory (CUDA) or local memory (OpenCL). 
Each thread then progresses over the data in the on-chip memory and ac- 
cumulates partial interactions on its particle. At the end of this pass, the 
size of the interaction list is decreased by A^biock and a new pass is started 
until the size of the list falls below A^biock- The particle-group interaction 
list is evaluated in exactly the same way, except that particle data is read 
into shared memory instead of cell data. This is a standard data-sharing 
approach that has been used in a variety of A'"-body codes, e.g. Nyland et 
al. [27]. 

The tree-traverse loop is repeated until all the cells from the current-level 
stack are processed. If the next-level stack is empty, the tree-traverse is 
complete, however if the next-level stack is non-empty its data is copied to 
the current-level stack. The next-level stack is cleaned and the current-level 
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current stack 



next-level stack 



cell-group list: a,e,f 

particle-group list: Ci,Ci+1 ,Ci+2,Ci+3,Ci+4 

Figure 3: Illustration of a single level tree-traverse. There are five cells in the current 
stack. Those cells which are marked with crosses terminate traverse, and therefore are 
added to the cell-group list for subsequent evaluation. Otherwise, if the cell is a node, its 
children are added to the next-level stack. Because children are contiguous in the tree-cell 
array, they are named as 6i, 6^ -t- 1, . . ., 6^ -|- 4, where bi is the index of the first child of 
the node b. If the cell is a leaf, its particles are added to the particle-cell interaction list. 
Because particles in a leaf are also contiguous in memory, we only need to know the index 
of the first particle, c^, and the number of particles in a leaf, which is 5 here. 
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stack is processed. When the tree-traverse is complete either the cell-group, 

particle-group or both interaction hsts may be non-empty. In such case the 
elements in these lists are distributed among the threads and evaluated in 
parallel. Finally if multiple threads per particle are used an extra reduction 
step combines the results of these threads to get the final interaction result. 

3. Gravitational Tree-code 

To demonstrate the feasibility and performance, we implement a ver- 
sion of the gravitational Barnes-Hut tree-code [10] which is solely based on 
the sparse octree methods presented in the previous sections. In contrast 
to other existing GPU codes [15, 17], our implementation runs entirely on 
GPUs. Apart from the previous described methods to construct and traverse 
the tree-structure we implement time integration (Section 3.1), time-step cal- 
culation and tree-cell properties computation on the GPU (Section 3.2). The 
cell opening method, which sets the accuracy and performance of the tree- 
traverse, is described in Section 3.3. 

3.1. Time Integration 

To move particles forward in time we apply the leapfrog predictor-corrector 
scheme described by Hut et al. (1995) [28]. Here the position and velocity 
are predicted to the next simulation time using previously calculated accel- 
erations. Then the new accelerations are computed (tree-traverse) and the 
velocities are corrected. This is done for all particles in parallel or on a subset 
of particles in case of the block-time step regime. For a cluster of ^ 10^ 
particles, the time required for one prediction-correction step is less than 1% 
of the total execution time and therefore negligible. 

3.2. Tree-cell properties 

Tree-cell properties are a summarized representation of the underlying 
particle distribution. The multipole moments are used to compute the forces 
between tree-cells and the particles that traverse the tree. In this implemen- 
tation of the Barnes-Hut tree-code we use only monopole and quadrupole 
moments [29] . Multipole moments are computed from particle positions and 
need to be recomputed at each time-step; any slowdown in their computation 
may substantially infiuence the execution time. To parallehse this process 
we initially compute the multipole moments of each leaf in parallel. We sub- 
sequently traverse the tree from bottom to top. At each level the multipole 
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Figure 4: Illustration of the computation of the multipole moments. First the properties 
of the leaves are calculated (green circles) . Then the properties of the nodes are calculated 
level-by-level from bottom to top. This is indicated by the numbers in the nodes, first we 
compute the properties of the node with number 1, followed by the nodes with number 2 
and finally the root node. 

moments of the nodes are computed in parallel using the moments of the cells 
one level below (Fig. 4). The number of GPU threads used per level is equal 
to the number of nodes at that level. These computations are performed in 
double precision since our tests indicated that the NVIDIA compiler aggres- 
sively optimises single precision arithmetic operations, which results in an 
error of at most 1% in the multipole moments. Double precision arithmetic 
solved this problem and since the functions are memory-bound^ the overhead 
is less than a factor 2. As final step the double precision values are converted 
back to single precision to be used during the tree-traverse. 

3.3. Cell opening criterion 

In a gravitational tree-code the multipole acceptance criterion (MAC) 
decides whether to use the multipole moments of a cell or to further traverse 
the tree. This influences the precision and execution time of the tree-code. 



^On GPUs we distinguish two kind of performance limitations, memory-bound and 
compute-bound. In the former the performance is limited by the memory speed and 
memory bandwidth, in the later the performance is limited by the computation speed. 
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The further the tree is traversed the more accurate the computed acceleration 
will be. However, traversing the tree further results in a higher execution time 
since the number of gravitational interactions increases. Therefore the choice 
of MAC is important, since it tries to determine, giving a set of parameters, 
when the distance between a particle and a tree cell is large enough that 
the resulting force approximation error is small enough to be negligible. The 
MAC used in this work is a combination of the method introduced by Barnes 
(1994) [30] and the method used for tree-traversal on vector machines [11]. 
This gives the following acceptance criterion, 

d>^ + 5 (1) 

where d is the smallest distance between a group and the center of mass of 
the cell, / is the size of the cell, is a dimensionlcss parameter that controls 
the accuracy and 5 is the distance between the cell's geometrical center and 
the center of mass. If d is larger than the right side of the equation the 
distance is large enough to use the multipole moment instead of traversing 
to the child cells. 

Fig. 5 gives an overview of this method. We also implemented the mini- 
mal distance MAC [31], which results in an acceleration error that is between 
10% and 50% smaller for the same 9 than the MAC used here. The compu- 
tation time, however, is almost a factor 3 higher since more cells are accepted 
(opened). 

The accuracy of the tree-traverse is controlled by the parameter 9. Larger 
values of 9 causes fewer cells to be opened and consequently results in a 
shallower tree-traverse and a faster evaluation of the underlying simulation. 
Smaller values of 9 have the exact opposite effect, but result in a more accu- 
rate integration. In the hypothetical case that all the tree cells arc opened 
{9 — 7> 0) the tree-code turns in an inefficient direct A^-body code. In Sec- 
tion 4.1 we adopt 9 = 0.5 and 9 = 0.75 to show the dependence of the 
execution time on the opening angle. In Section 4.2 we vary 9 between 0.2 
and 0.8 to show the dependence of the acceleration error on 9. 

4. Performance and Accuracy 

In this section we compare the performance of our implementation of 
the gravitational A^-body code (Bonsai) with CPU implementations of com- 
parable algorithms. Furthermore, we use a statistical test to compare the 
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Figure 5: Schematic overview of the muhipole acceptance criterion. We determine if the 
cell has to be opened (continue tree-traversal) for the whole group and therefore take the 
minimal distance d between the group and the center of mass of the node. The cell size 
and the distance between the center of mass and the geometrical center are indicated by 
I and S respectively. The parameters are used in Eq. 1 to determine if the cell has to be 
opened. 
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Table 1: Used hardware. The Xeon is the CPU in the host system, the other five devices 
are CPUs. 



Hardware model 


Xeon E5620 


8800 GTS512 


C1060 


GTX285 


C2050 


GTX180 


Architecture 


Gulftown 


G92 


GT200 


GT200 


GFIOO 


GFIOO 


^ Cores 


4 


128 


240 


240 


448 


480 


Core (Mhz) 


2400 


1625 


1296 


1476 


1150 


1401 


Memory (Mhz) 


1066 


1000 


800 


1243 


1550 


1848 


Interface (bit) 


192 


256 


512 


512 


384 


384 


Bandwidth (GBs) 


25.6 


64 


102 


159 


148 


177.4 


Peak (GFL0Ps)2 


76.8 


624 


933 


1063 


1030 


1345 


Memory size (GB) 


16 


0.5 


4 


1 


2.5 


1.5 



^AU calculations in this work are done in single precision arithmetic. 

^The peak performance is calculated as follows: 

Gulftown: #Cores x Core speed x 8 (SSE flops/cycle) 

G92 & GT200: #Cores x Core speed x 3 (flops/cycle) 

GFIOO: #Cores x Core speed x 2 (flops/cycle) 



accuracy of Bonsai with a direct summation code. As final test Bonsai is 
compared with a direct A^-body code and a set of A'-body tree-codes using 
a production type galaxy merger simulation. 

Even though there are quite a number of tree-code implementations each 
has its own specific details and it is therefore difficult to give a one-to-one 
comparison with other tree-codes. The implementations closest to this work 
are the parallel CPU tree-code of John Dubinski (1996) [32] (Partree) and 
the GPU accelerated tree-code Octgrav [17]. Other often used tree-codes 
either have a different MAC or lack quadrupole corrections. The default 
version of Octgrav has a different MAC than Bonsai, but for the galaxy 
merger simulation a version of Octgrav is used that employs the same method 
as Bonsai (Section 3.3). We use phiGRAPE [33] in combination with the 
Sapporo [9] direct A"-body GPU library for the comparison with direct A"- 
body simulations. Although here used as standalone codes, most of them 
are part of the AMUSE framework [34], as will be a future version of Bonsai 
which would make the comparison trivial to execute. 

The hardware used to run the tests is presented in Table 1. For the CPU 
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calculations we used an Intel Xeon E5620 CPU which has 4 physical cores. 
For CPUs, we used 1 GPU with the G92 architecture (GeForce 8800GTS512), 
2 GPUs with the GT200 architecture (GeForce 285GTX and Tesla C1060), 
and 2 GPUs with the GFIOO architecture (GTX480 and Tesla C2050). All 
these GPUs are produced by NVIDIA. The Tesla C2050 GPU is marketed as a 
professional High Performance Computing card and has the option to enable 
error-correcting code memory (ECC). With ECC enabled extra checks on the 
data are conducted to prevent the use of corrupted data in computations, 
but this has a measurable impact on the performance. Therefore, the tests 
on the C2050 are executed twice, once with and once without ECC enabled 
to measure the impact of ECC. 

All calculations are conducted in single precision arithmetic except for 
the computation of the monopole and quadrupole moments in Bonsai and 
the force calculation during the acceleration test in phi GRAPE for which we 
use double precision arithmetic. 

4.1. Performance 

To measure the performance of the implemented algorithms we execute 
simulations using Plummer [35] spheres with N = 2^^ ( 32k) up to = 
2^4 ( 16M) particles (up to A^ = 2^^ ( 4M) for the GTX480, because of 
memory limitations). For the most time critical parts of the algorithm we 
measure the wall-clock time. For the tree-construction we distinguish three 
parts, namely sorting of the keys (sorting), reordering of particle properties 
based on the sorted keys (moving) and construction and linking of tree-cells 
(tree-construction). Furthermore, are timings presented for the multipole 
computation and tree-traverse. The results are presented in Fig. 6. The 
wall-clock time spend in the sorting, moving, tree-construction and multipole 
computation algorithms scales linearly with A^ for > 10®. For smaller 
A", however, the scaling is sub-linear, because the parallel scan algorithms 
require more than 10^ particles to saturate the GPU. The inset of Fig. 6 
shows that the average number of particle-cell interactions doubles between 
A^ > 32k and A^ < IM and keeps gradually increasing for A^ > IM. Finally, 
more than 90% with 9 = 0.75 and 95% with 9 = 0.5 of the wall-clock time is 
spent on tree-traversal. This allows for block time-step execution where the 
tree-traverse time is reduced by a factor A^groups/A^active; where Aactive is the 
number of groups with particles that have to be updated. 

In Fig. 7 we compare the performance of the tree-algorithms between the 
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Figure 6: The wall-clock time spent by various parts of the program versus the num- 
ber of particles N. We used Plummer models as initial conditions [35] and varied the 
number of particles over two orders of magnitude. The solid black line, which is offset 
arbitrarily, shows the theoretical 0{NlogN) scaling [10]. The asymptotic complexity of 
the tree-construction approaches 0{N), because all the constituent primitives share the 
same complexity. The tree-construction timing comes from the GTX480. To show that 
the linear scaling continues we added timing data for the C1060, which allows usage of 
larger data sets. For the GTX480 we included the results of the tree-traverse with 9 = 0.5 
and the results of the tree-traverse with 9 = 0.75. The inset shows the average number of 
particle-particle and particle-cell interactions for each simulation where 9 — 0.75. 
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three generations of GPUs as well as against tuned CPU implementations^. 
For all algorithms the CPU is between a factor of 2 (data reordering) to 
almost a factor 30 (tree-traverse) slower than the fastest GPU (GTX480). 
Comparing the results of the different GPUs we see that the GTS512 is slow- 
est in all algorithms except for the data moving phase, in which the C1060 
is the slowest. This is surprising since the C1060 has more on-device band- 
width, but the lower memory clock-speed appears to have more influence than 
the total bandwidth. Overall the GFIOO generation of GPUs have the best 
performance. In particular, during the tree-traverse part, they are almost a 
factor 2 faster than the GT200 series. This is more than their theoretical 
peak performance ratios, which are 1.1 and 1.25 for C1060 vs. C2050 and 
GTX285 vs. GTX480 respectively. In contrast, the GTX285 executes the 
tree-traverse faster than the C1060 by a factor of 1.1 which is exactly the 
peak performance ratio between these GPUs. We believe that the difference 
between the GT200 and GFIOO GPUs is mainly caused by the lack of LI and 
L2 caches on GT200 GPUs that are present on GFIOO GPUs. In the latter, 
non-coalesced memory accesses are cached, which occur frequently during 
the tree-traverse, this reduces the need to request data from the relatively 
slow global memory. This is supported by auxiliary tests where the texture 
cache on the GT200 GPUs is used to cache non-coalesced memory reads, 
which resulted in a reduction of the tree-traverse execution time between 20 
and 30%. Comparing the C2050 results with ECC-memory to those without 
ECC-memory we notice a performance impact on memory-bound functions 
that can be as high as 50% (sorting), while the impact on the compute-bound 
tree-traverse is negligible, because the time to perform the ECC is hidden 
behind computations. Overall we find that the implementation scales very 
well over the different GPU generations and makes optimal use of the newly 
introduced features of the GFIOO architecture. The performance of the tree- 
traverse with 6 = 0.75 is 2.1M particles/s and 2.8M particles/s on the C2050 
and GTX480 respectively for N = IM. 



^The tree-construction method is similar to [23], and was implemented by Keigo Nita- 
dori with OpcnMP and SSE support. The trcc-travcrsc is, however, from the CPU version 
of the MPI-parallel tree-code by John Dubinski [32]. It has monopole and quadrupole 
moments and uses the same multipole acceptance criterion as our code. We ran this code 
on the Xeon E5620 CPU using 4 parallel processes where each process uses one of the 4 
available physical cores. 
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Figure 7: Wall-clock time spent by the CPU and different generations of CPUs on various 
primitive algorithms. The bars show the time spent on the five selected sections of code 
on the CPU and 5 different CPUs (spread over 3 generations). The results indicate that 
the code outperforms the CPU on all fronts, and scales in a predictable manner between 
different CPUs. The C2050-ECC bars indicate the runs on the C2050 with ECC enabled, 
the C2050 bars indicate the runs with ECC disabled. Note that the y-axis is in log scale. 
(Timings using a 2^*^ million body Plummer sphere with 9 = 0.75) 
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4-2. Accuracy 

To measure the accuracy of the tree-code we use two tests. In the first, the 

accelerations due to the tree-code are compared with accelerations computed 
by direct summation. In the second test, we compared the performance and 
accuracy of three tree-codes and a direct summation code using a galaxy 
merger simulation. 

4.2.1. Acceleration 

To quantify the error in the accelerations between phiGRAPE and Bonsai 
we calculate 

Aa/a — jatree ~ ^direct I /Indirect I, (2) 

where atree and aairect are accelerations obtained by tree and direct summa- 
tion respectively. The direct summation results are computed with a double 
precision version of Sapporo, while for tree summation single precision is 
used. For both methods the softening is set to zero and a GTX480 GPU is 
used as computation device. 

In Fig. 8 the error distribution for different particle numbers and opening 
angles is shown. Each panel shows the fraction of particles (vertical-axis) 
having a relative acceleration error larger than a given value (horizontal-axis). 
The three horizontal dotted lines show the 50th, 10th and 1st percentile of 
the cumulative distribution (top to bottom). The results indicate that the 
acceleration error is slightly smaller (less than an order of magnitude) than 
Octgrav and comparable to CPU tree-codes with quadrupole corrections 
[36, 37, 38]. In Octgrav a different MAC is used than in Bonsai which 
explains the better accuracy results of Bonsai. 

The dependence of the acceleration error on 6 and the number of particles 
is shown in Fig. 9. Here the median and first percentile of the relative 
acceleration error distributions of Fig. 8 are plotted as a function of 9. The 
figure shows that the relative acceleration error is nearly independent of 
A^, which is a major improvement compared to Octgrav where the relative 
acceleration error clearly depends on N (Figure 5 in [17]). The results are 
consistent with those of Partree [32] which uses the same MAC. 

4-2.2. Galaxy merger 

A realistic comparison between the different A^-body codes, instead of 
statistical tests only, is performed by executing a galaxy merger simulation. 
The merger consists of two galaxies, each with 10^ dark matter particles. 
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Figure 8: Each panel displays a fraction of particles, F{> Aa/a), having a relative ac- 
celeration error, Aa/a, (vertical axis) greater than a specified value (horizontal axis). In 
each panel the solid lines show the errors for various opening angles, from left to right 
6 = 0.2, 0.3, 0.4, 0.5, 0.6, 0.7 and 0.8. The panels show, from left to right, simulations 
with A'' = 32768, N = 131072 and N = 1048576 particles. The dotted horizontal lines 
indicate 50%, 10% and 1% of the error distribution. 

2 X 10^ star particles and one super massive black hole (for a total of 240.002 
bodies). The galaxies have a 1:3 mass ratio and the pericenter is lOkpc. The 
merger is simulated with Bonsai, Octgrav, Partree and phiGRAPE. The 
used hardware for Bonsai and Octgrav was 1 GTX480, Partree used 4 cores 
of the Intel Xeon E5620 and phiGRAPE used 4 GTX480 CPUs. With each 
tree-code we ran two simulations, one with 9 — 0.5 and one with 9 — 0.75. 
The end-time of the simulations is T = 1000 with a shared time-step of 
^ (resulting in 64000 steps) and a gravitational softening of e = 0.1. For 
phiGRAPE the default time-step settings were used, with the maximum time- 
step set to ^ and softening set to e = 0.1. The settings are summarised in 
the first four columns of Table 2. 

We compared the density, cumulative mass and circular velocity profiles 
of the merger product as produced by the different simulation codes, but 
apart from slight differences caused by small number statistics the profiles 
are identical. As final comparison we recorded the distance between the 
two black holes over the course of the simulation. The results of which are 
shown in the bottom panel of Fig. 10, the results are indistinguishable up 
to the third pericenter passage ai t — 300 after which the results, because 
of numerical differences, become incomparable. Apart from the simulation 
results we also compare the energy conservation. This is done by computing 
the relative energy error {dE, Eq. 3) and the maximal relative energy error 
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Figure 9: The median and the first percentile of the relative acceleration error distribution 
as a function of the opening angle and the number of particles. We show the lines for 
N = 32768 (bottom striped and solid line) N = 131072 (middle striped and dotted line) 
and N = 1048576 (top striped and dotted line). 
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Table 2: Settings and results of the galaxy merger. The first two columns indicate the 
software and hardware used, the third the time-step (dt) and the fourth the opening 
angle {9). The last three columns present the results, energy error at the time = 1000 
(fifth column), maximum energy error during the simulation (sixth column) and the total 
execution time (seventh column). 
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dE=^ (3) 

Here Eq is the total energy (potential energy + kinetic energy) at the 
start of the simulation and Ef is the total energy at time t. The time is in 
iV-body units. 

The maximal relative energy error is presented in the top panel of Fig. 10, 
the tree-code simulations with 9 — 0.75 give the highest dEmax which occurs 
for both 6 = 0.75 and 6 ~ 0.5 during the second pericenter passage att ^ 280. 
For 6 = 0.5, the dE^ax is roughly a factor 2 smaller than for 9 = 0.75. For 
phiGRAPE dEynax shows a drift and does not stay constant after the second 
pericenter passage. The drift in the energy error is caused by the formation 
of binaries, for which phiGRAPE has no special treatment, resulting in the 
observed drift instead of a random walk. 

A detailed overview of the energy error is presented in Fig. 11 which shows 
the relative energy error {dE) over the course of the simulation. Comparing 
the dE of the tree-codes shows that Bonsai has a more stable evolution than 
Octgrav and Partree. Furthermore if we compare the results of ^ = 0.75 and 
9 = 0.5 there hardly is any improvement visible for Octgrav while Bonsai 
and Partree show an energy error with smaller per time-step variance of the 
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Figure 10: The top panel shows the maximal relative energy error over the course of 
the simulation. The solid lines show the results of Bonsai, the striped lines the results of 
Octgrav and the striped-dotted lines show the results of Partree. In all cases the top lines 
show the result of 9 — 0.75 and the bottom lines the result of 6 — 0.5. Finally the dotted 
line shows the result of phiGRAPE. The bottom panel shows the distance between the two 
supermassive black-holes. The lines themselves have no labels since up to i = 300 the lines 
follow the same path and after t = 300 the motion becomes chaotic and incomparable. 
The distance and time are in A^-body units. 



24 



energy error. 

The last thing to look at is the execution time of the various codes, which 
can be found in the last column of Tab. 2. Comparing Bonsai with Octgrav 
shows that the former is faster by a factor of 1.6 and 1.26 for 9 = 0.75 and 
9 — 0.5 respectively. The smaller speed-up for 9 — 0.5 results from the fact 
that the tree-traverse, which takes up most of the execution time, is faster in 
Octgrav than in Bonsai. Comparing the execution time of Bonsai with that 
of Partree shows that the former is faster by a factor of 17 (24) with 9 = 0.75 
{9 = 0.5). Note that this speed-up is smaller than reported in Section 4.1 
due to different initial conditions. Finally, when comparing phiGRAPE with 
Bonsai, we find that Bonsai completes the simulation on 1 GTX480 faster 
than phiGRAPE, which uses 4 GTX480 CPUs, by a factor of 8.7 {9 = 0.75) 
and 4.9 {9 = 0.5). 

5. Discussion and Conclusions 

We have presented algorithms to efficiently construct and traverse hierar- 
chical data-structures. These algorithms are implemented as part of a gravi- 
tational A'"-body tree-code. In contrast to other existing GPU tree-codes, this 
implementation is executed on the GPU. While the code is written in CUDA, 
the methods themselves are portable to other massively parallel architectures, 
provided that parallel scan-algorithms exist for such architectures. For this 
implementation a custom CUDA API wrapper is used that can be replaced 
with an OpenCL version. As such the code can be ported to OpenCL, by 
only rewriting the GPU functions, which is currently work in progress. 

The number of particles processed per unit time, is 2.8 million particles 
per second with = 0.75 on a GTX480. Combined with the stable energy 
evolution and efficient scaling permits us to routinely carry out simulations 
on the GPU. Since the current version can only use 1 GPU, the limitation is 
the amount of memory. For 5 million particles ±1 gigabyte of GPU memory 
is required. 

Although the the tree-traverse in Octgrav is ±10% faster than in Bonsai, 
the latter is much more appropriate for large (A^ > 10^) simulations and 
simulations which employ block-time steps. In Octgrav the complete tree- 
structure, particle array and multipole moments are send to the GPU during 
each time-step. When using shared-time steps this is a non-critical amount 
of overhead since the overall performance is dominated by the tree-traverse 
which takes up more than 90% of the total compute time. However, this 
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Figure 11: Relative energy error over the course of the simulation. The top panel shows 
the results oi 9 = 0.75 and the bottom panel the results of = 0.5, all other settings as 
defined in Tab. 2. The Bonsai results are shown by the blue lines, the Octgrav results are 
shown by the red lines, the Partree results are shown by the green lines and the black 
lines show the results of phiGRAPE. Note that the result of phiGRAPE is the same in the 
top and bottom panel and that the range of the y-axis is the same in both panels. 
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balance changes if one uses block time-steps. The tree-traverse time is re- 
duced by a factor Ai'gi-oups/A^activc! where A^'acth-c is the number of groups with 
particles that have to be updated. This number can be as small as a few 
percent of iVgroups; and therefore tree-construction, particle prediction and 
communication becomes the bottleneck. By shifting these computations to 
the GPU, this ceases to be a problem, and the required host communication 
is removed entirely. 

Even though the sorting, moving and tree-construction parts of the code 
take up roughly 10% of the execution time, these methods do not have to be 
executed during each time-step when using the block time-step method. It 
is sufficient to only recompute the multipole moments of tree-cells that have 
updated child particles. Only when the tree-traverse shows a considerable 
decline in performance the complete tree-structure has to be rebuild. This 
decline is the result of inefficient memory reads and an increase of the aver- 
age number of particle-cell and particle-particle interactions. This quantity 
increases because the tree-cell size (l) increases, which causes more cells to 
be opened by the multipole acceptance criterion (Eq. 1). 

Although the algorithms described herein are designed for a shared-memory 
architecture, they can be used to construct and traverse tree-structures on 
parallel GPU clusters using the methods described in [23, 32] . Furthermore, 
in case of a parallel GPU tree-code, the CPU can exchange particles with 
the other nodes, while the GPU is traversing the tree-structure of the local 
data. In this way, it is possible to hide most of the communication time. 

The presented tree-construction and tree-traverse algorithms are not lim- 
ited to the evaluation of gravitational forces, but can be applied to a variety of 
problems, such as neighbour search, clump finding algorithms, fast multipole 
method and ray tracing. In particular, it is straightforward to implement 
Smoothed Particle Hydrodynamics in such a code, therefore having a self- 
gravitating particle based hydrodynamics code implemented on the GPU. 
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Appendix A. Scan algorithms 

Both tree-construction and tree-traverse make extensive use of parallel- 
scan algorithms, also known as parallel prefix-sum algorithms. These algo- 
rithms are examples of computations that seem inherently sequential, but 
for which an efficient parallel algorithm can be defined. Blelloch [39] defines 
the scan operations as follows: 

Definition: The all-prefix-sums operation takes a binary associative opera- 
tor ©, and an array of n elements 

[oo, Oi, a„_i]. 



32 



and returns the ordered set 

[ao,{ao © fli), ...,(ao © ai © ... © On-i)]. 

Example: If © is the addition operator, then the all-prefix-sums operation 
on the array 

[3 1 7 4 1 6 3], 

would return 

[3 4 11 11 15 16 22 25]. 

The prefix-sum algorithms form the building blocks for a variety of methods, 
including stream compaction, stream splitting, sorting, regular expressions, 
tree-depth determination and histogram construction. In the following para- 
graphs, we give a concise account on the algorithms we used in this work, 
however we refer the interested readers to the survey by Blelloch [39] for 
further examples and detailed descriptions. 

Appendix A.l. Stream Compaction 

Stream compaction removes "invalid" items from a stream of elements; 
this algorithm is also known as stream reduction. In the left panel of the 
Fig. A. 12 an example of a compaction is shown where invalid elements are 
removed and valid elements are placed at the start of the output stream. 

Appendix A. 2. Split and Sort 

Stream split is related to stream compaction, but differs in that the invalid 
elements are placed behind the valid ones in the output stream instead of 
being discarded (right panel of Fig. A. 12). This algorithm is used as building 
block for the radix sort primitive. Namely, for each bit of an integer we call 
the split algorithm, starting from the least significant bit, and terminating 
with the most significant bit [40] . 

Appendix A. 3. Implementation 

All parts of our parallel octree algorithms use scan algorithms in one 
way or another. Therefore, it is important that these scan algorithms are 
implemented in the most efficient way and do not become the bottleneck of 
the application. There are many different implementations of scan algorithms 
on many-core architectures [41, 42, 43]. We use the method of Billeter et al. 
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Figure A. 12: Example of compact and split algorithms. The "compact" discards invalid 
items whereas the "split" places these behind the valid items. We use this "split" primitive 
for our radix sort implementation because it preserves the item ordering~a property which 
is fundamental for the radix sort algorithm. 



[43] for stream compaction, split and radix-sort because it appears to be, at 
the moment of writing, the fastest and is easily adaptable for our purposes. 
Briefly, the method consists of three steps: 

1. Count the number of valid elements in the array. 

2. Compute output offsets using parallel prefix-sum. 

3. Place valid elements at the output offsets calculated in the previous 
step. 

We used the prefix-sum method described by Sengupta et al. [42], for all 
such operations in both the tree-construction and tree-traverse parts of the 
implementation. 



Appendix B. Morton Key generation 

One of the properties of the Morton key is its direct mapping between 
coordinates and keys. To generate the keys given a set of coordinates one 
can make use of look-up tables or generate the keys directly. Since the 
use of look-up tables is relative inefficient on CPUs (because of the many 
parallel threads that want to access the same memory) we decided to compute 
the keys directly. First we convert the floating point positions into integer 
positions. This is done by shifting the reference frame to the lower left corner 
of the domain and then multiply the new positions by the size of the domain. 
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Then we can apply bit-based dilate primitives to compute the Morton key 
(List.l, [44]). This dilate primitive converts the first 10 bits of an integer to 
a 30 bit representation, i.e. 0100111011^ 000 001 000 000 001 001 001 
000 001 001: 



List 1: The GPU code which we use to dilate the first 10-bits of an integer. 



1 


int di lat e ( cons t int value) { 


2 


unsigned int x; 




3 


X = value k 0x03FF; 




4 


x = ((x << 16) + x) & 


OxFFOOOOFF; 


5 


X = ((x << 8) + x) & 


OxOFOOFOOF; 


6 


X = ((x << 4) + x) & 


0xC30C30C3; 


7 


X = ((x << 2) + x) & 


0x49249249 ; 


8 


return x; 




9 


} 






This dilate primitive 


is combined with bit-shift and OR operators to get 



the particles' key. In our implementation, we used 60-bit keys, which is 
sufficient for an octree with the maximal depth of 20 levels. We store a 
60-bit key in two 32-bit integers, each containing 30-bits of the key. The 
maximal depth imposes a limit on the method, but so far we have never run 
into problems with our simulations. This limitation can easily be lifted by 
either going to 90-bit keys for 30 levels or by modifying the tree-construction 
algorithm when we reach deepest levels. This is something we are currently 
investigating. 
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